Skip to content

test: add real runAgentInSandbox() E2E tests for telegram injection#1267

Closed
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:test/telegram-injection-e2e
Closed

test: add real runAgentInSandbox() E2E tests for telegram injection#1267
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:test/telegram-injection-e2e

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 1, 2026

Summary

Add Phases 7-11 to test-telegram-injection.sh addressing PR #1092 feedback from @cv: exercise real production code paths instead of ad-hoc SSH commands.

Depends on #1266 (Brev E2E infra hardening) — merge that first.

Split from #1219 to keep test logic separate from infrastructure changes.

Changes

test/e2e/test-telegram-injection.sh

New test phases:

  • Phase 7: Real runAgentInSandbox() with $(command) and backtick injection payloads (T9-T11)
  • Phase 8: Multi-line message handling with embedded injection (T12-T13)
  • Phase 9: Session ID option injection prevention via sanitizeSessionId, including negative Telegram chat IDs (T14-T15)
  • Phase 10: Empty and special-character-only session IDs (T16-T17)
  • Phase 11: Cleanup of marker files and temp scripts

These tests are expected to FAIL on unfixed code (runAgentInSandbox is not exported, signature doesn't accept options), proving they detect the vulnerability. They will pass once PR #119 is merged.

Testing

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only.

Refs: #118, PR #1092

Summary by CodeRabbit

Release Notes

  • New Features

    • Improved GPU detection with more robust validation during setup
    • Added retry logic for Node.js and Docker installations with increased visibility into the process
  • Bug Fixes

    • Fixed Docker runtime configuration on CPU-only systems to prevent GPU-related errors
    • Resolved apt/dpkg lock contention issues during initial setup
  • Tests

    • Expanded security testing coverage for environment variable injection prevention
    • Added comprehensive edge-case validation for session ID handling

Shared E2E infrastructure improvements that make Brev-based CI reliable
on CPU-only instances:

brev-setup.sh:
- Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must
  run successfully, not just exist on PATH (Brev GPU images ship the
  binary even on CPU instances)
- All GPU gates (container toolkit, Docker runtime reset, vLLM) use
  the single HAS_GPU flag
- Replace cloud-init || true with proper error check + warning
- Add timeout (300s) and quiet window (5s) to apt wait loop to prevent
  indefinite hangs and races
- Add retry loops (5 attempts, 30s backoff) for Node.js and Docker
  apt-get installs
- Unsuppress NodeSource installer output for debugging
- Reset Docker default runtime to runc on CPU-only instances where
  Brev pre-configures nvidia as default

setup.sh:
- Check nvidia-smi exit code instead of just PATH presence for GPU
  gateway flag

brev-e2e.test.js:
- Use known GCP instance type (n2-standard-4) instead of cpu search
- Add shellQuote() for safe secret interpolation in SSH commands
- Delete leftover instances before creating new ones
- Wait for cloud-init before running bootstrap
Add Phases 7-11 to test-telegram-injection.sh addressing PR NVIDIA#1092
feedback from @cv: exercise real production code paths instead of
ad-hoc SSH commands.

New test phases:
- Phase 7: Real runAgentInSandbox() with $(command) and backtick
  injection payloads (T9-T11)
- Phase 8: Multi-line message handling with embedded injection (T12-T13)
- Phase 9: Session ID option injection prevention via sanitizeSessionId,
  including negative Telegram chat IDs (T14-T15)
- Phase 10: Empty and special-character-only session IDs (T16-T17)
- Phase 11: Cleanup of marker files and temp scripts

These tests are expected to FAIL on unfixed code (runAgentInSandbox is
not exported, signature doesn't accept options), proving they detect the
vulnerability. They will pass once PR NVIDIA#119 is merged.

Refs: NVIDIA#118, PR NVIDIA#1092
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Scripts across provisioning, setup, and testing were updated. GPU detection now validates successful execution rather than command presence. Apt access is serialized with cloud-init synchronization, background service termination, process killing, lock clearing, and idle-loop verification. Node.js and Docker installations use retry logic with backoff. E2E tests now specify instance types directly and await cloud-init completion. Telegram injection tests expand coverage with sandboxing and session ID sanitization validation.

Changes

Cohort / File(s) Summary
GPU Detection & Routing
scripts/brev-setup.sh, scripts/setup.sh
Changed GPU availability checks from command -v nvidia-smi (PATH check) to direct nvidia-smi execution validation; routes GPU-dependent logic through unified HAS_GPU flag.
Apt Lock & Serialization
scripts/brev-setup.sh
Introduced exclusive apt access phase: waits for cloud-init completion, stops/disables background apt timers, force-kills apt/dpkg processes, clears locks, configures dpkg, then loops until idle with max timeout enforcement.
Installation Retry Logic
scripts/brev-setup.sh
Updated Node.js and Docker installations to retry up to 5 times with 30s backoff; removed output suppression to /dev/null for visibility.
Docker Runtime Configuration
scripts/brev-setup.sh
On CPU-only systems, removes "default-runtime": "nvidia" from Docker daemon config when GPU is unavailable and restarts Docker to use default runc runtime.
E2E Provisioning
test/e2e/brev-e2e.test.js
Replaced numeric CPU/RAM constraints with single instance type override (BREV_INSTANCE_TYPE); added pre-provisioning cleanup, changed environment variable quoting strategy, and inserted cloud-init synchronization before bootstrap.
E2E Injection Testing
test/e2e/test-telegram-injection.sh
Expanded test coverage (Phases 7–10) with injection prevention validation via sandbox markers, process argument inspection, session ID sanitization testing, and edge case handling for empty/special-character-only session IDs.

Sequence Diagram(s)

sequenceDiagram
    actor Script as Brev Setup<br/>Script
    participant CloudInit as cloud-init
    participant SystemD as systemd<br/>(timers)
    participant AptDpkg as apt/dpkg<br/>Processes
    participant Locks as Lock<br/>Files
    participant DpkgConfig as dpkg<br/>--configure

    Script->>CloudInit: Wait for cloud-init<br/>status --wait
    CloudInit-->>Script: Ready
    
    Script->>SystemD: Stop/disable apt-daily*<br/>& unattended-upgrades
    SystemD-->>Script: Stopped
    
    Script->>AptDpkg: Force-kill apt-get,<br/>apt, dpkg processes
    AptDpkg-->>Script: Terminated
    
    Script->>Locks: Clear apt/dpkg<br/>lock files
    Locks-->>Script: Cleared
    
    Script->>DpkgConfig: Run dpkg --configure -a
    DpkgConfig-->>Script: Configuration complete
    
    loop Until Idle or Timeout
        Script->>AptDpkg: Check for<br/>running processes
        alt Idle Window Detected
            AptDpkg-->>Script: No activity ✓
            Script->>Script: Exit loop
        else Still Active
            AptDpkg-->>Script: Still running
            Script->>Script: Wait & retry
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With GPU flags now truly tested,
And apt locks soundly nested,
Retries bounce like hops so keen,
The cleanest setup ever seen!
Sessions safe from injection's play,
Terraform'd tests light the way!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title describes the main addition of real runAgentInSandbox() E2E tests for Telegram injection, which aligns with the primary test file change (385 lines added to test-telegram-injection.sh). However, the PR also includes significant infrastructure hardening in brev-setup.sh and setup.sh that constitutes a substantial portion of the work and changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/test-telegram-injection.sh (1)

569-595: Consider running the backgrounded command in an explicit subshell for clarity.

The background job at lines 569-571 uses cd "$REPO" && ... which implicitly creates a subshell when backgrounded. While this works, an explicit subshell (cd "$REPO" && ...) would make the intent clearer and ensure consistent behavior across shell versions.

That said, the current implementation is functionally correct—kill $BRIDGE_PID will signal the job leader, and wait $BRIDGE_PID will reap it properly.

🔧 Optional: explicit subshell
-  cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
+  (cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
     NEMOCLAW_SANDBOX_NAME="$SANDBOX_NAME" \
-    timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check' &
+    timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 569 - 595, The backgrounded
command that sets OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node
/tmp/test-real-bridge.js should be executed in an explicit subshell to make the
intent clear and ensure consistent behavior across shells; wrap the current `cd
"$REPO" && ...` invocation in a subshell (i.e., `(cd "$REPO" &&
OPENSHELL_PATH="..." NEMOCLAW_SANDBOX_NAME="..." timeout 60 node
/tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &` so BRIDGE_PID
continues to be captured from `$!` and the environment/working-directory change
is isolated. Ensure BRIDGE_PID is still assigned immediately after backgrounding
and that the later kill $BRIDGE_PID and wait $BRIDGE_PID logic remains
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 569-595: The backgrounded command that sets
OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node /tmp/test-real-bridge.js
should be executed in an explicit subshell to make the intent clear and ensure
consistent behavior across shells; wrap the current `cd "$REPO" && ...`
invocation in a subshell (i.e., `(cd "$REPO" && OPENSHELL_PATH="..."
NEMOCLAW_SANDBOX_NAME="..." timeout 60 node /tmp/test-real-bridge.js 'Hello
world' 'e2e-ps-check') &` so BRIDGE_PID continues to be captured from `$!` and
the environment/working-directory change is isolated. Ensure BRIDGE_PID is still
assigned immediately after backgrounding and that the later kill $BRIDGE_PID and
wait $BRIDGE_PID logic remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a1d5163-1d85-4ee7-8e01-e6c247d15bea

📥 Commits

Reviewing files that changed from the base of the PR and between e231d32 and 77e0862.

📒 Files selected for processing (4)
  • scripts/brev-setup.sh
  • scripts/setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh

@wscurran wscurran added status: triage For new items that haven't been reviewed yet. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. and removed status: triage For new items that haven't been reviewed yet. labels Apr 1, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Apr 1, 2026

✨ Thanks for submitting this pull request, which proposes a way to add real runAgentInSandbox() E2E tests for Telegram injection. Possibly related open issues: #1107, #1114, #1109.


Possibly related open issues:

@wscurran wscurran added the Integration: Telegram Use this label to identify Telegram bot integration issues with NemoClaw. label Apr 1, 2026
@jyaunches
Copy link
Copy Markdown
Contributor Author

Closing this PR — per @aerickson's direction, the Telegram bridge is going away entirely, making this injection test work redundant/no longer needed.

@jyaunches jyaunches closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. Integration: Telegram Use this label to identify Telegram bot integration issues with NemoClaw. NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants